COMETS Analytics support all cohort-specific analyses of the COMETS consortium. This collaborative work is done via the COMETS harmonization group activities. For more information, see the [COMETS website] (http://epi.grants.cancer.gov/comets/). This vignette demonstrates using the COMETS R package from the command line. See the tutorial for using COMETS from the GUI. Documentation of the COMETS R package can be found here manual.
The required input file should be in excel format with the following 6 sheets:
Complete documentation of the various sheets can be found in the package documentation. An example input file is available HERE.
Empty cells in any excel sheet become missing values when the R software reads the sheet into a data frame. Missing values are also assigned to non-numeric values for continuous variables, which are the metabolite variables and variables in the VarMap sheet with VARTYPE = continuous.
The first step of the analysis is to load in the data with the readCOMETSinput() function.
Input for this function is an Excel spreadsheet, per the description above.
# Retrieve the full path of the input data
dir <- system.file("extdata", package="COMETS", mustWork=TRUE)
csvfile <- file.path(dir, "cometsInputAge.xlsx")
# Read in and process the input data
exmetabdata <- COMETS::readCOMETSinput(csvfile)
## [1] "Metabolites sheet is read in"
## [1] "SubjectMetabolites sheet is read in"
## [1] "SubjectData sheet is read in"
## [1] "VarMap sheet is read in"
## [1] "Models sheet is read in"
## [1] "ModelOptions sheet is read in"
## [1] "There are 14 categorical variables"
## [1] "There are 2 continuous variables"
## [1] "Running Integrity Check..."
## Joining, by = "id"
## Joining, by = "hmdb_id"
## [1] "Input data has passed QC (metabolite and sample names match in all input files)"
To plot some the distribution of variances for each metabolite:
COMETS::plotVar(exmetabdata,titlesize=12)
## Warning: The titlefont attribute is deprecated. Use title = list(font = ...)
## instead.
To plot the distribution of minimum/missing values:
COMETS::plotMinvalues(exmetabdata,titlesize=12)
## Warning: The titlefont attribute is deprecated. Use title = list(font = ...)
## instead.
There are 2 ways to specify your model, batch or interactive. In Batch mode, models are specified in your input file Models sheet. The model information needs to be read in with the function getModelData() and processed so the software knows which models to run. The following call defines the “1 Gender adjusted” model from the Models sheet in the input file to be run.
exmodeldata <- COMETS::getModelData(exmetabdata,modlabel="1 Gender adjusted")
In Interactive mode, models are specified as parameters. The model information needs to be read in with the function getModelData() and processed so the software knows which models to run.
The following call defines the model with age and bmi_grp as the exposure variables, and includes only the subjects with age > 40 and bmi_grp > 2.
exmodeldata2 <- COMETS::getModelData(exmetabdata, modelspec="Interactive",
exposures=c("age","bmi_grp"), where=c("age>40","bmi_grp>2"))
## [1] "Filtering subjects according to the rule(s) age>40 & bmi_grp>2 . 279 of 1000 are retained."
The runModel() function is the main function for running a single model, and by default, a correlation analysis is performed.
excorrdata <- COMETS::runModel(exmodeldata2,exmetabdata,"DPP")
The output of the correlation analysis can then be compiled and output to an Excel file with the following function:
COMETS::OutputListToExcel(filename="DPP_corr.xlsx", excorrdata)
To view the first 3 lines of the correlation analysis output, simply type:
COMETS::showModel(excorrdata,nlines=3)
##
## ModelSummary:
## run cohort spec model outcomespec exposurespec nobs
## 1 1 DPP Interactive _1_2_3_benzenetriol_sulfate_2 age 279
## 2 2 DPP Interactive _1_2_dipalmitoylglycerol age 279
## 3 3 DPP Interactive _1_2_propanediol age 279
## message adjvars adjvars.removed adjspec outcome_uid
## 1 CHEM100006374
## 2 HMDB07098
## 3 HMDB01881
## outcome exposure_uid adj_uid
## 1 1,2,3-benzenetriol sulfate (2) age
## 2 DG(16:0/16:0/0:0) age
## 3 Propylene glycol age
##
## Effects:
## run outcomespec exposurespec term corr p.value
## 1 1 _1_2_3_benzenetriol_sulfate_2 age age 0.164624501 0.005846722
## 2 2 _1_2_dipalmitoylglycerol age age 0.068903188 0.251337451
## 3 3 _1_2_propanediol age age 0.001667259 0.977882521
## NULL
To display the heatmap of the resulting correlation matrix, use the showheatmap function.
COMETS::showHeatmap(excorrdata)
To display the hierarchical clustering of the resulting correlation matrix, use the showHClust function. This diplay requires at least 2 rows and 2 columns in the correlation matrix.
exmodeldata<-COMETS::getModelData(exmetabdata,modelspec = "Interactive",exposures = c("bmi_grp","age"))
excorrdata <- COMETS::runModel(exmodeldata,exmetabdata,"DPP")
COMETS::showHClust(excorrdata, showticklabels=FALSE)
Results can be written to an output Excel file with the following command:
COMETS::OutputListToExcel("Model1.xlsx", excorrdata)
A stratified correlation analysis can be performed by specifiying stratification variables in the call to getModelData(). If more than one stratification variable is specified, then the strata will be defined by all unique combinations of the stratification variables. The following call will define a model stratified by race_grp.
exmodeldata2 <- COMETS::getModelData(exmetabdata,modelspec="Interactive",
outcomes=c("lactose","lactate"),
exposures=c("age","bmi_grp"),strvars="race_grp")
The stratified correlation analysis is run by calling the runModel() function.
excorrdata2 <- COMETS::runModel(exmodeldata2,exmetabdata,"DPP")
Call getModelData() to define a model which adjusts for age group, has lactose and lactate as outcome variables, and has age and bmi group as the exposure variables.
exmodeldata <- COMETS::getModelData(exmetabdata,modelspec="Interactive", adjvars="age_grp",
outcomes=c("lactose","lactate"), exposures=c("age","bmi_grp"))
To run a linear regression using the lm function, a list of options must be passed into runModel() with the model option set to “lm”.
lm_results <- COMETS::runModel(exmodeldata, exmetabdata, "DPP", op=list(model="lm"))
print(lm_results)
## $ModelSummary
## run cohort spec model outcomespec exposurespec wald.pvalue r.squared
## 1 1 DPP Interactive lactose age 0.16451267 0.01467966
## 2 2 DPP Interactive lactate age 0.60149903 0.00165591
## 3 3 DPP Interactive lactose bmi_grp 0.02387982 0.02206445
## 4 4 DPP Interactive lactate bmi_grp 0.00166282 0.01641147
## adj.r.squared sigma loglik aic bic deviance df.residual
## 1 0.01171183 1.8029707 -2006.3702 4022.7404 4047.2792 3237.70036 996
## 2 -0.00135115 0.2882131 -172.8794 355.7588 380.2975 82.73452 996
## 3 0.01714526 1.7980076 -2002.6087 4019.2173 4053.5716 3213.43441 994
## 4 0.01146384 0.2863629 -165.4342 344.8684 379.2227 81.51171 994
## nobs message adjvars adjvars.removed adjspec outcome_uid
## 1 1000 age_grp.2;age_grp.3 age_grp HMDB00186
## 2 1000 age_grp.2;age_grp.3 age_grp HMDB00190
## 3 1000 age_grp.2;age_grp.3 age_grp HMDB00186
## 4 1000 age_grp.2;age_grp.3 age_grp HMDB00190
## outcome exposure_uid adj_uid
## 1 Alpha-Lactose age age_grp.2;age_grp.3
## 2 L-Lactic acid age age_grp.2;age_grp.3
## 3 Alpha-Lactose bmi_grp age_grp.2;age_grp.3
## 4 L-Lactic acid bmi_grp age_grp.2;age_grp.3
##
## $Effects
## run outcomespec exposurespec term estimate std.error statistic
## 1 1 lactose age age 0.034697534 0.024961296 1.3900534
## 2 2 lactate age age -0.002083854 0.003990177 -0.5222460
## 3 3 lactose bmi_grp bmi_grp.2 -0.047905160 0.134163189 -0.3570663
## 4 3 lactose bmi_grp bmi_grp.3 0.360610471 0.145954081 2.4707118
## 5 3 lactose bmi_grp bmi_grp.4 0.420224133 0.577141138 0.7281133
## 6 4 lactate bmi_grp bmi_grp.2 0.034207637 0.021367743 1.6009008
## 7 4 lactate bmi_grp bmi_grp.3 0.080743228 0.023245640 3.4734783
## 8 4 lactate bmi_grp bmi_grp.4 -0.126241151 0.091919426 -1.3733892
## p.value
## 1 0.1648232433
## 2 0.6016151626
## 3 0.7211179261
## 4 0.0136512756
## 5 0.4667157218
## 6 0.1097166272
## 7 0.0005359045
## 8 0.1699410577
##
## attr(,"ptime")
## [1] "Processing time: 0.14 sec"
Run a linear regression using the glm function for the same variables as above. The default family used with glm is “gaussian”, which corresponds to a linear regression. The Effects data frame will be the same as with lm, but the ModelSummary data frame will contain some different columns.
glm_results <- COMETS::runModel(exmodeldata, exmetabdata, "DPP", op=list(model="glm"))
print(all.equal(lm_results$Effects, glm_results$Effects))
## [1] TRUE
print(glm_results$ModelSummary)
## run cohort spec model outcomespec exposurespec converged wald.pvalue
## 1 1 DPP Interactive lactose age 1 0.16451267
## 2 2 DPP Interactive lactate age 1 0.60149903
## 3 3 DPP Interactive lactose bmi_grp 1 0.02387982
## 4 4 DPP Interactive lactate bmi_grp 1 0.00166282
## null.deviance df.null loglik aic bic deviance df.residual
## 1 3285.93681 999 -2006.3702 4022.7404 4047.2792 3237.70036 996
## 2 82.87175 999 -172.8794 355.7588 380.2975 82.73452 996
## 3 3285.93681 999 -2002.6087 4019.2173 4053.5716 3213.43441 994
## 4 82.87175 999 -165.4342 344.8684 379.2227 81.51171 994
## nobs message adjvars adjvars.removed adjspec outcome_uid
## 1 1000 age_grp.2;age_grp.3 age_grp HMDB00186
## 2 1000 age_grp.2;age_grp.3 age_grp HMDB00190
## 3 1000 age_grp.2;age_grp.3 age_grp HMDB00186
## 4 1000 age_grp.2;age_grp.3 age_grp HMDB00190
## outcome exposure_uid adj_uid
## 1 Alpha-Lactose age age_grp.2;age_grp.3
## 2 L-Lactic acid age age_grp.2;age_grp.3
## 3 Alpha-Lactose bmi_grp age_grp.2;age_grp.3
## 4 L-Lactic acid bmi_grp age_grp.2;age_grp.3
Call getModelData() to define a model which adjusts for age group, has nested_case as the outcome variable, and has lactose and lactate as the exposure variables. The variable nested_case must be a binary 0-1 variable.
exmodeldata <- COMETS::getModelData(exmetabdata,modelspec="Interactive", adjvars="age_grp",
outcomes="nested_case", exposures=c("lactose","lactate"))
To run a logistic regression, the list of options op must also include a model.options list with family set to “binomial”.
op <- list(model="glm", model.options=list(family="binomial"))
glm_results <- COMETS::runModel(exmodeldata, exmetabdata, "DPP", op=op)
print(glm_results)
## $ModelSummary
## run cohort spec model outcomespec exposurespec converged wald.pvalue
## 1 1 DPP Interactive nested_case lactose 1 0.51856182
## 2 2 DPP Interactive nested_case lactate 1 0.01003264
## null.deviance df.null loglik aic bic deviance df.residual nobs
## 1 1386.278 999 -692.5559 1393.112 1412.743 1385.112 996 1000
## 2 1386.278 999 -689.4160 1386.832 1406.463 1378.832 996 1000
## message adjvars adjvars.removed adjspec outcome_uid outcome
## 1 age_grp.2;age_grp.3 age_grp nested_case nested_case
## 2 age_grp.2;age_grp.3 age_grp nested_case nested_case
## exposure_uid adj_uid
## 1 HMDB00186 age_grp.2;age_grp.3
## 2 HMDB00190 age_grp.2;age_grp.3
##
## $Effects
## run outcomespec exposurespec term estimate std.error statistic
## 1 1 nested_case lactose lactose 0.02268456 0.03513914 0.6455639
## 2 2 nested_case lactate lactate 0.57214513 0.22221799 2.5747022
## p.value
## 1 0.51856182
## 2 0.01003264
##
## attr(,"ptime")
## [1] "Processing time: 0.1 sec"
Call getModelData() to define a model which adjusts for age group, has n_visits as the outcome variable, and has lactose and lactate as the exposure variables. The variable n_visits must be a non-negative integer valued variable.
exmodeldata <- COMETS::getModelData(exmetabdata,modelspec="Interactive", adjvars="age_grp",
outcomes="n_visits", exposures=c("lactose","lactate"))
To run a Poisson regression, the list of options op must also include a model.options list with family set to “poisson”.
op <- list(model="glm", model.options=list(family="poisson"))
poisson_results <- COMETS::runModel(exmodeldata, exmetabdata, "DPP", op=op)
print(poisson_results)
## $ModelSummary
## run cohort spec model outcomespec exposurespec converged wald.pvalue
## 1 1 DPP Interactive n_visits lactose 1 0.6105400
## 2 2 DPP Interactive n_visits lactate 1 0.8271571
## null.deviance df.null loglik aic bic deviance df.residual nobs
## 1 632.649 999 -1538.073 3084.146 3103.777 630.4096 996 1000
## 2 632.649 999 -1538.179 3084.357 3103.988 630.6207 996 1000
## message adjvars adjvars.removed adjspec outcome_uid outcome
## 1 age_grp.2;age_grp.3 age_grp n_visits n_visits
## 2 age_grp.2;age_grp.3 age_grp n_visits n_visits
## exposure_uid adj_uid
## 1 HMDB00186 age_grp.2;age_grp.3
## 2 HMDB00190 age_grp.2;age_grp.3
##
## $Effects
## run outcomespec exposurespec term estimate std.error statistic
## 1 1 n_visits lactose lactose 0.00624383 0.01225956 0.5093028
## 2 2 n_visits lactate lactate 0.01679890 0.07693599 0.2183491
## p.value
## 1 0.6105400
## 2 0.8271571
##
## attr(,"ptime")
## [1] "Processing time: 0.1 sec"
Below is example code to compute 95% confidence intervals for the estimates from the glm results above in the Effects data frame. Confidence intervals for the lm results are similar.
x <- glm_results$Effects
z <- qnorm((1 - 0.95)/2, lower.tail=FALSE)
x[, "estimate.lower"] <- x[, "estimate"] - z*x[, "std.error"]
x[, "estimate.upper"] <- x[, "estimate"] + z*x[, "std.error"]
print(x)
## run outcomespec exposurespec term estimate std.error statistic
## 1 1 nested_case lactose lactose 0.02268456 0.03513914 0.6455639
## 2 2 nested_case lactate lactate 0.57214513 0.22221799 2.5747022
## p.value estimate.lower estimate.upper
## 1 0.51856182 -0.04618689 0.09155601
## 2 0.01003264 0.13660588 1.00768438
All models desginated in the input file can be run with one command, and individual output Excel files or correlation results will be written in the current directory by default. The function returns a list of objects.
allresults <- COMETS::runAllModels(exmetabdata,writeTofile=F)
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-148 fs_1.5.0 usethis_2.0.0
## [4] lubridate_1.7.9.2 devtools_2.3.2 webshot_0.5.2
## [7] RColorBrewer_1.1-2 httr_1.4.2 rprojroot_2.0.2
## [10] backports_1.2.0 tools_4.0.2 ISwR_2.0-8
## [13] R6_2.5.0 rpart_4.1-15 COMETS_1.5.0.0
## [16] lazyeval_0.2.2 colorspace_2.0-0 nnet_7.3-14
## [19] withr_2.4.1 mnormt_2.0.2 tidyselect_1.1.0
## [22] gridExtra_2.3 prettyunits_1.1.1 processx_3.4.5
## [25] compiler_4.0.2 cli_2.2.0 TSP_1.1-10
## [28] desc_1.2.0 plotly_4.9.3 labeling_0.4.2
## [31] scales_1.1.1 psych_2.0.12 callr_3.5.1
## [34] stringr_1.4.0 digest_0.6.27 rmarkdown_2.6
## [37] pkgconfig_2.0.3 htmltools_0.5.1.1 sessioninfo_1.1.1
## [40] fastmap_1.1.0 readxl_1.3.1 htmlwidgets_1.5.3
## [43] rlang_0.4.10 rstudioapi_0.13 farver_2.0.3
## [46] generics_0.1.0 jsonlite_1.7.2 crosstalk_1.1.1
## [49] dendextend_1.14.0 dplyr_1.0.3 ModelMetrics_1.2.2.2
## [52] magrittr_2.0.1 Matrix_1.2-18 Rcpp_1.0.6
## [55] munsell_0.5.0 fansi_0.4.2 viridis_0.5.1
## [58] lifecycle_0.2.0 stringi_1.5.3 pROC_1.17.0.1
## [61] yaml_2.2.1 MASS_7.3-51.6 pkgbuild_1.2.0
## [64] plyr_1.8.6 recipes_0.1.15 grid_4.0.2
## [67] parallel_4.0.2 crayon_1.3.4 lattice_0.20-41
## [70] splines_4.0.2 tmvnsim_1.0-2 knitr_1.31
## [73] ps_1.5.0 pillar_1.4.7 corpcor_1.6.9
## [76] reshape2_1.4.4 codetools_0.2-16 stats4_4.0.2
## [79] pkgload_1.1.0 glue_1.4.2 evaluate_0.14
## [82] data.table_1.13.6 remotes_2.2.0 vctrs_0.3.6
## [85] foreach_1.5.1 cellranger_1.1.0 testthat_3.0.1
## [88] gtable_0.3.0 purrr_0.3.4 tidyr_1.1.2
## [91] heatmaply_1.1.1 assertthat_0.2.1 cachem_1.0.1
## [94] ggplot2_3.3.3 xfun_0.20 gower_0.2.2
## [97] prodlim_2019.11.13 broom_0.7.3 class_7.3-17
## [100] survival_3.1-12 viridisLite_0.3.0 timeDate_3043.102
## [103] seriation_1.2-9 tibble_3.0.5 iterators_1.0.13
## [106] registry_0.5-1 memoise_2.0.0 lava_1.6.8.1
## [109] ellipsis_0.3.1 caret_6.0-86 subselect_0.15.2
## [112] ipred_0.9-9